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Abstract. We observe a random measure N and aim at estimating 
its intensity s. This statistical framework allows to deal simultaneously 
with the problems of estimating a density, the marginals of a multivari- 
ate distribution, the mean of a random vector with nonnegative compo- 
nents and the intensity of a Poisson process. Our estimation strategy 
is based on estimator selection. Given a family of estimators of s based 
on the observation of N, we propose a selection rule, based on N as 
well, in view of selecting among these. Little assumption is made on the 
collection of estimators. The procedure offers the possibility to perform 
model selection and also to select among estimators associated to differ- 
ent model selection strategies. Besides, it provides an alternative to the 
T-estimators as studied recently in Birge (2006). For illustration, we 
consider the problems of estimation and (complete) variable selection in 
various regression settings. 



We consider k independent random measures Ni , . . . , Nk where the iVj are 
defined on an abstract probability space (Q, T, P) with values in the class of 
positive measures on measured spaces (Xi, Ai, Hi). We assume that 



(1) E[Ni(A)]= / Sidfr < +00, for all A e A and all i = l,...,k 



where each Sj is a nonnegative and measurable function on Xi that we shall 
call the intensity of iVj. Equality (1) implies that the TV, are a.s. finite 
measures and that for all measurable and nonnegative functions fi on Xi, 



Our aim is to estimate s = (s±, . . . , s&) from the observation of N = (N\, . . . , N^) 
We shall set X = (X\, . . . , X^), A = (Ai, ■ . ■ , Ak), fJ- = (fJ-i, ■ ■ ■ , Hk) an d de- 
note by C the cone of nonnegative and measurable functions t of the form 
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(ii, . . . , tk) where the U are positive and integrable functions on (Xi, Ai, m). 
For / = (/i, . . . , fk) € C, we use the notations 

fdN = y, fi dN i and / f d » = Y, 

Throughout, £o denotes a known subset of C which we assume to contain s. 
This statistical framework we have described allows to deal simultaneously 
with the more classical ones given below: 

Example 1 (Density Estimation). Consider the problem of estimating a 
density s on (X, A, fi) from the observation of an n-sample X\, . . . , X n with 
distribution P s = sdji. In order to handle this problem, we shall take k = 1, 
Co the set of densities on (X,A) with respect to n and N = n _1 X^=i & x i- 

Example 2 (Estimation of marginals). Let X±, . . . , X n be independent ran- 
dom variables with values in the measured spaces (Xi,Ai, fJ>i), ■ ■ ■ , (X n , A n , fj, n ) 
respectively. We assume that for all i, Xi admits a density Sj with respect 
to [ii and our aim is to estimate s = (si,...,s n ) from the observation of 
of X = (Xi, . . . , X n ). We shall deal with this problem by taking k = n and 
Ni = $Xi for i = 1, ... ,n. Note that this setting includes as a particular 
case that of the regression framework 

Xi = fi + Ei, i = l,...,n 

where the £j are i.i.d. random variables with a known distribution. The 
problem of estimating the densities of the Xi then amounts to estimating 
the shift parameter f = (/i, . . . , f n ). 

Example 3 (Estimating the intensity of a Poisson process). Consider the 
problem of estimating the intensity s of a possibly inhomogeneous Poisson 
process N on a measurable space (X,A). We shall assume that s is in- 
tegrable. This statistical setting is a particular case of our general one by 
taking k = 1 and Cq = C. 

Other examples will be introduced later on. 

Throughout, we shall deal with estimators with values in Co and to measure 
their risks, endow Co with the distance H defined for t, t' in Co by 

When k = 1 and t, t! are densities with respect to fi, H is merely the Hellinger 
distance between the corresponding probabilities. Given an estimator s of s, 
i.e. a measurable function of with s G Co, we define its risk by E [i? 2 (s, I)] . 

Let us now give an account of our estimation strategy We consider an at 
most countable family {S m , m G M} of subsets of Co, that we shall call 
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models, and a family of positive weights {A m ,m £ Ai] on these satisfying 

E = ^ e" Am < +00. 

When E = 1, the A m define a prior distribution on the family of models 
and give thus a Bayesian flavor to the procedure. Then, we assume that 
we have at disposal a collection {s\, A G A} of estimators of s based on N 
with values in § = {J meM S m . We mean that each estimator s\ belongs 
to some S m among the family, the index m = m(A) being possibly random 
depending on the observation N. The index set A need not be countable 
even though we shall assume so in order to avoid measurability problems. 
However, the reader can check that the cardinality of A will play no role in 
our results. Our aim is to select some A among A, on the basis of the same 
observation N, in such a way that the risk of estimator s = J? is as close as 
possible to inf Ae AE [H 2 (s, s\)~\ . More precisely, the results we get have the 
following form 

(3) CE [H 2 (s, 5)] < inf {E [H 2 (s, s x )] + rE [D a(a) V A m(x) ] } + rE 2 , 
where 

• the number C is a positive universal constant; 

• the number r is a scaling parameter depending on the statistical 
framework (r = 1/n in the density case and r = 1 in the case of 
Example 2); 

• the numbers D rn measure the massiveness (in some suitable sense) 
of the models S m (typically, D m corresponds to its metric dimension 
to be defined later on). 

In Inequality (3), the element m(A) corresponds to an arbitrary element 
(chosen by the statistician) among the random subset M(s\) defined by 

M(s\) = {meM, s\ G S m }. 

Of course, a minimizer of D m V A rn among those m in A4(s\) provides a 
natural choice for m(A) since it minimizes the right-hand side of (3). Other 
choices are possible. For example if for some deterministic m £ A4, s\ 
belongs to some S m with probability one, it is convenient to take 771(A) — ttl. 
This is in general the case in the context of model selection for which one 
associates to each model S m a single estimator, denoted s m rather than s\, 
with values in S m . Then, by taking A = M, Inequality (3) takes the more 
usual form 

(4) CE [H 2 (s, S)] < inf {E [H 2 (s, s m )] +r(D m VA m Vl)} 
where C depends on E only. 

In the present paper, our purpose is to go beyond the classical model selec- 
tion scheme by allowing the family of estimators to take their values in a 
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random model, depending on N, among the collection {S m , m £ M.}. Using 
the same observation N, our selection procedure is based on a comparison 
pair by pair of the estimators s\. We do so by mean of a penalized criterion 
based on an estimation of the distance H of each estimator to the true s. 
From these pairwise comparisons, we use the selection device inspired from 
Birge (2006) and Baraud and Birge (2009) to select our estimator s among 
the family {s\, A £ A}. 

Because of these comparisons pair by pair, our procedure is all the more 
difficult to implement that the cardinality of A is large. For example, if one 
tries to estimate a density by an histogram and aims at finding a "good" 
partition among a family A of candidate ones, these comparisons will be time 
consuming and practically almost useless if | A| is too large. Nevertheless, one 
can take advantage that our procedure allows to deal with random partitions 
m in view of reducing the family A to those m selected from the data by 
an appropriate algorithm such as CART for example. From this point of 
view, our approach can be seen (at least theoretically) as an alternative to 
resampling procedures (such as V-fold cross-validation, bootstrap,...). 

The starting point of this paper originates from a series of papers by Lucien 
Birge (Birge (2006), Birge (2007) and Birge (2008)) providing a new per- 
spective on estimation theory. His approach relies on ideas borrowed from 
old papers by Le Cam (1973), Le Cam (1975), Birge (1983), Birge (1984b), 
Birge (1984a), showing how to derive good estimators from families of robust 
tests between simple hypotheses, and also more recent ones about complexity 
and model selection such as Barron and Cover (1991) and Barron, Birge and 
Massart (1999). The resulting estimator is called a T-estimator (T for test) 
and its construction, detailed in Birge (2006), relies on a good discretization 
of the models. A nice feature of those T-estimators lies in the fact that they 
require very few assumptions on the collections of models and the param- 
eter set. Our general approach is inspired by this paper even though the 
procedure we propose is different and allows to consider estimators instead 
of only discretization points. 

The problem of designing a selection rule solely based on the data in order 
to choose a "good" model among a collection of candidate ones is the art 
of model selection. This approach has been intensively studied in the re- 
cent years. For example, Castellan (2000a), Castellan (2000b), Birge (2008), 
Massart (2007) (Chapter 7) considered the problem of estimating a density, 
Reynaud-Bouret (2003) and Birge (2007) that of estimating the intensity 
of a Poisson process, and the regression setting has been studied in Ba- 
raud (2000), Birge and Massart (2001) and Yang (1999) among other ref- 
erences. Performing model selection for the problem of selecting among 
histogram-type estimators in the statistical frameworks described in Exam- 
ples 1 and 3 (among others) has been considered in Baraud and Birge (2009). 
A common feature of all these results on model selection lies in the fact that 
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they hold for specific estimators built on a given model. In the present paper, 
we shall not specify the estimators s m which can therefore be arbitrary. 

An alternative to model selection is aggregation (or mixing) . The basic idea 
is to design a suitable combination of given estimators in order to outper- 
form each of these separately. This approach can be found in Juditsky and 
Nemirovski (2000), Nemirovski (2000), Yang (2000a), (2000b), (2001), Tsy- 
bakov (2003), Wegkamp (2003), Bunea, Tsybakov and Wegkamp (2007) and 
Catoni (2004) (we refer to his course of Saint Flour which takes back some 
mixing technics he introduced earlier). When the data are not i.i.d., some 
nice results of aggregation can be also be found in Leung and Barron (2006) 
for the problem of mixing least-squares estimators of a mean of a Gaussian 
vector Y . In their paper, they assume the components of Y to be indepen- 
dent with a known common variance. Giraud (2009) extended their results 
to the case where it is unknown. 

The paper is organized as follows. The basic ideas underlying our approach 
will be described in Section 2 and the main results are presented in Section 3. 
In Sections 4 and 5, we show how our procedure provides an alternative to 
these T-estimators and histogram-type estimators respectively studied in 
Birge (2006) and Baraud and Birge (2009) under the same assumptions. 
Moreover, we shall also consider in Section 4 the case of histogram-type 
estimators based on random partitions (obtained by an algorithm such as 
CART for example). In Section 6, we consider the problem of estimating 
the mean s of a random vector with nonnegative and independent compo- 
nents (typically the distributions we have in mind are Binomial, Poisson or 
Gamma). We consider two cases. One corresponds to the situation where 
V~s = (\/*i> • • • ) \fsn) is of the form (F(x\), . . . , F(x n )) for some nonnegative 
function F and points x±, . . . ,x n in [0, 1]. For this problem, we show that 
the resulting estimator achieves the usual rate of convergence over classes 
of Besov balls. Alternatively, we consider the situation where y/s is a linear 
combination of predictors v 1 , . . . , v p the number p being allowed to be larger 
than n. The problem we consider is that of variable selection and we aim 
at selecting a "best" subset of predictors in view of minimizing the estima- 
tion risk. Section 7 is devoted to the regression framework as described in 
Example 2. We consider there the problem of complete variable selection 
when the errors are not Gaussian nor sub-Gaussian which, to our knowledge, 
is new. In the opposite, the Gaussian case has been intensively studied in 
the recent years. It has been the usual statistical setting for justifying the 
use of numerous procedures among which Birge and Massart (2001), Tib- 
shirani (1996) with the Lasso, Efron et al (2004) for LARS, Candes and 
Tao (2007) for the Dantzig selector and Baraud, Giraud and Huet (2009) 
when the variance of the errors is unknown. As we shall see, our selection 
procedure requires very mild assumptions on the distribution of the errors 
(provided that it is known). In particular, we need not assume that the 
errors admit any finite moment. Finally, Section 8 is devoted to the proofs. 
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Throughout, we shall use the following notations. The quantity \E\ denotes 
the cardinal of a finite set E. The Euclidean norm of W 1 is denoted || ||. We 
set R* + = R+ \ {0} and for t € denote by y/i the vector (a/*!, y/U) . 
Given a closed convex subset A of W 1 , Ua is the projection operator onto 
A. We set for t £ Co and T C Co, H(t,F) = mi feT H(t, f) and for y > 0, 

B(t,y) = {t' € Co, H(t,t')<y}. 

Throughout z denotes some number in the interval (0,1 — 1 /V2) to be chosen 
arbitrarily by the statistician and C,C ,C" , ... constants that may vary from 
line to line. 



2. Basic formulas and basic ideas 

The aim of this section is to present the basic formulas and ideas underlying 
our approach. For the sake of simplicity, we shall assume k = 1 until further 
notice. For t G Co, we define 

p(s, t) = Vsi dp. 
Jx 

This quantity corresponds to the Hellinger affinity whenever s and t are 
densities. Note that H 2 (s,t) is related to p(s,t) by the formula 

2H 2 (s,t)= / sdp+ I tdp-2p(s,t). 
Jx Jx 

Throughout, t,t' will denote two elements of Co one should think of as 
estimators of s. One would prefer t' to t if H 2 (s, t') is smaller than H 2 (s, t) 
or equivalently if 



P(s, - ^ J t'dp - p(s, t) - X - J ■ 



tdp 



> 0. 



Since j x tdp and J x t'dp are both known, deciding whether tl is preferable 
to t amounts to estimating p(s, t) and p(s, t') in a suitable way. In the 
following sections, we present the material that will enable us to estimate 
these quantities on the basis of the observation N. 



2.1. An approximation of .). We start with the following variational 
formula. 

Proposition 1. Let S be a subset of Co containing s. For all t G Co, we 

have 

p(s,t) = ini [p r (sdp,t) 

rg5 



where, for a measure v on (X,A), 

1 



(5) 



p r (u,t) = 



p(t,r) + 
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(using the conventions 0/0 = and a/0 = +00 for all a > 0). Besides, the 
infimum is achieved for r = s. 

Proof. With the above conventions, note that for all nonnegative numbers 
x,y, 2\fx < y/y + x/^/y. By applying this inequality with x = st, y = rt, 
the result follows by integration with respect to p. Besides, equality holds 
for r = s. □ 



It follows from the above proposition that, for a given r G Co, p r (sdp,t) 
approximates p(s,t) from above. In fact, we can make this statement a 
little bit more precise. 

Proposition 2. Let s,t,r G Cq. We have, 

p r (sdp, t) - p(s, t) = ^J (y/s - y/rf dp. 

Ifr = (t + H) /2 with t' G C , then 

(6) < p r (sdp, t) - p(s, t)<^= [H\s, t) + H\s, t')] . 



Proof. It follows from the definition of p r that 



2[p r (sdp,t) - p(s,t)\ 



tr dp + 



sdp — 21 \fst dp 
Jx 



For the second part, note that (t/r)(x) < 2 for all x G X and therefore 
p r (sdp, t) —p(s, t) < a/2 H 2 (s, r). It remains to bound H 2 (s, r) from above. 
The concavity of the map t ^ \/i implies that p(s, r) > [p(s, t) + p(s, t')] /2 
and therefore 2H 2 (s, r) < H 2 (s, t) + H 2 (s, t'), which leads to the result. □ 

The important point about Proposition 2 (more precisely Inequality (6)) lies 
in the fact that the constant l/y/2 is smaller than 1. This makes it possible 
to use the (sign of the) difference 

1 r 1 [ 1 

p r (sdp,t') - - t'dp 
1 Jx 



T(sdp,t,t') = 



p r (sdp, t) 



tdp 



Jx 



with r = (t + t')/2 as an alternative benchmark to find the closest element 
to s (up to a multiplicative constant) among the pair (t, t'). More precisely, 
we can deduce from Proposition 2 the following corollary. 



Corollary 1. If T(sdp,t,t') > 0, then 

H 2 (s,t') < 



2 ' X ' 2 • '//V-./i. 



y/2-1' 
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Proof. Using Inequality (6) and the assumption, we have 



H 2 (s,t') -H 2 (s,t) 



P(s, t) - ^ J tdfi 



1 



P(s,t')- - J J dp 



p r (sdfi,t)-- [ tdfi - pr(sdfi,t') - - [ t'dfl 

1 J x \ V 1 Jx 

+ p (s, t) - p r (sdfi, t) + p r (sdfi, t') - p (s, t') 

< ^[H 2 (s,t)+H 2 ( S ,t')] 



which leads to the result. 



□ 



2.2. An estimator of p r (., .). Throughout, given t,t' G Co, w e set 

t + t' 



r = 



G Co. 



The superiority of the quantity p r (sdfi, t) over p (s, t) lies in the fact that 
the former can easily be estimated by its empirical counterpart, namely 



(7) 



p(t,r) 



-dK 
r 



Note that p r (N,t) is an unbiased estimator of p r (sdfi,t) because of (2). 
Consequently, a natural way of deciding which between t and t' is the closest 
to s is to consider the test statistics 



T(N,t,t') 



Pr{N,t')- 1 - j Up 



p r (N,t) tdp 



Replacing the "ideal" test statistic T(sdp, t, t') by its empirical counterpart 
leads to an estimation error given by the process Z(N, ., .) defined on Cq by 

Z(N,t,t') = T(N,t,t') -T(sdfi,t,t') 

= [ Pr (N, t') - Pr (sdp, t')] - [ Pr {N, t) - p r (sdfi, t)} 

= / ip(t,t',x)dN - / tf)(t,t',x)sdp 
Jx Jx 

where ip(t,t',x) is the function on C\ x X with values in [— \j\[2 , 1 /V2] 
given by 



(8) ip(t,t',x) 



V2 



l + t{x)/t'{x) I/ 1 +t'(x)/t(x) 



The study of the empirical process Z(N, ., .) over the product space S x S' 
is at the heart of our technics. 
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2.3. The multidimensional case k > 1. In the multidimensional case, the 
same results can be obtained by reasoning component by component. More 
precisely, the formulas of the above sections extend by using the convention 
that for all fc-uplets v = {y\, . . . , v^) of measures on [X\,A\), . ■ ■ , {X^,Ak) 
respectively, 

k 

/ <f>(s,t,t?,r)dv = ^SiiU^r^dvi, 

JX ~[ JXi 

whatever the functions s,t,t',r G Cq and mappings <^> from into R. 



3. The main results 



Throughout this section, we consider an at most countable index set M 
and a family {S m , m £ M} of nonvoid subsets of Co, we shall refer to as 
models. Besides, we assume we have at disposal an at most countable family 
{s\, A G A} of estimators of s based on N with values in § = UmeA-t ^m- 
In particular, to each A G A corresponds an estimator sa together with a 
(possibly random) index m(A) 6 A4 such that sa £ SVh(A)- Setting for £ G S, 

M (f) = {m £ A^, t G SWJ 

we therefore have m(A) G .M(sa). We associate a nonnegative weight A m 
to each m G M. and assume that 

(9) £ = ^ e _Am < +oo and A m > 1 for all meM. 

mG.M 

The condition A m > 1 for all m G X is only required to simplify the 
presentation of our results. 

As already mentioned in the introduction, our aim is to select some estimator 
among the family {s\, A G A} in order to achieve the smallest possible risk. 
We shall distinguish between two situations. 



3.1. Direct selection. Let r, 7 be positive numbers. We consider the fol- 
lowing selection procedure 



Procedure 1. Let pen be some penalty function mapping § into K 
a pair sy) such that s\ / sy, we consider the test statistic 

(10) T(N,s x ,sy) = Pr (N,s x >) - ^ J^sydp-pen(sy) 

- p r (N, s\) - - [ s\dfi - pen(s A ) 
z Jx 

where r = [s\ + s\/)/2 and p r {N, .) is given by (7). We set 
8(h) = {h>, T(N,s x ,sy)>0} 



Given 
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and note that either s\ G £(s\>) or sy G £(sa) since T(N, s\, s\i) = 
—T(N,s\i,s\). Then, we define 

V(s x ) = sup {H 2 (s A , sy ) | h' G £(«a)} if £(*a) / 

and P(sa) = otherwise. Finally, we select A among A as any element 
satisfying 

V(s x ) < V(s x ) +t, VAGA. 

For (t, t') G £g an d 2/ > 0, let us set 

w 2 (t,t',y) = [H 2 (s,t)+H 2 (s,t')] Vy 2 . 
We assume the following. 

Assumption 1 (r, 7). For all pairs (m, m') G i/iere exist positive 

numbers d m , d m > such that for all £ > and y 2 > t (d m V d m / + 



Z{N,t,t) 

SUp -^7 : r > Z 

(t,v)es m xs m , w 2 (t,t',y) 



< 7e 



This assumption means that for ^ large enough the error process Z(N, t,t') 
is uniformly controlled by w 2 (t, t' , y) over S m x S^/ with probability close to 
1. Under suitable assumptions, the quantities d m measure in some sense the 
massiveness of the S m . For example, if S m is the linear span of piecewise 
constant function on each element of a partition m of X, then d m is merely 
proportional to the cardinality of m. If S m is a discrete subset Co, d m is 
related to its metric dimension (in a sense to be specified later on). 

We obtain the following result. 

Theorem 1. Let r, 7 be numbers and {A m , m G A4} a family of nonnega- 
tive numbers satisfying (9). Under Assumption 1, choose s = s^ among the 
family {s\, A G A} according to Procedure 1 with pen satisfying 

(11) pen(f) > zrmi {d m + A m , m G M (t)} V£ G S. 

Then, for all £ > 0, 

F H 2 (s, s) > Ci 



inf [H 2 (s, s x ) + pen (s A )] 



< (7S 2 e-«) 



Al. 



where C\ = C\(z) and C2 = C2(z) are positive numbers given by (36) 
and (37) respectively, depending on the choice of z only. 



The proof is delayed to Section 8.1. 

By integration with respect to £ we deduce the following risk bound. 
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Corollary 2. Under the assumptions of Theorem 1, there exists a constant 
C depending on z only such that 



CE[H 2 (s,s)] < E 



inf {H 2 (s,s x ) + pen(s A )} 



AeA 



+ r[( 7 S 2 ) VI] 



< inf {E [H 2 (s, s x ) + pen(s A )] } + r [( 7 S 2 ) V l] . 



In particular, if equality holds in (11), 

(12) E[H 2 (s,s)] < C'inf {E[H 2 (s,sx)] + E[v 2 (s x )]} 



where, for all X £ A, 
(13) v 2 (s x )=t 



inf d m V A m 

m£M(s\) 



< r (d MX) V A A(A) ) 



and C is a constant depending on 2,7 and S. 

Inequality (12) compares the risk of the resulting estimator s to those of the 
s x plus an additional term E [w 2 (s A )] . If s x belongs to S m with probability 
1, 

(14) v 2 (s x ) < r(d m VA m ). 

We emphasize that (14) does not take into account the complexity of the 
collection of estimators {s x , AeA} itself. In particular, if for all A G A, s x 
belongs to a same model S m with probability 1, then by taking M = {m} 
and A m = 1, we obtain for s the following risk bound 



E [H 2 (s, s)] < C I inf E [H 2 (a, s A )l + r (d m V 1) 
I^AeA 

no matter how large the collection of s x is. 



3.2. Indirect selection. Let r, M be some positive numbers. Through- 
out this section, we assume that for some nonnegative numbers a, b, c, the 
measure satisfies the following. 



Assumption 2 (a,b,c). For all y,£ > 

sup P [Z(N, t, t') > f] < 6exp 

t,t'£B(s,y) 



y + cC. 



This assumption is satisfied in the following cases. 

Proposition 3. Assumption 2 holds with a = n 2 /6, 6=1 and c = n\/2/6 
for Example 1, with a = 1/6, b = 1 and c = y/2/6 for Example 2 and with 
a = 1/12, b = l andc = v^2/36 for Example 3. 
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The proof of the proposition is delayed to Section 8.4. 

In order to select among the family of estimators {s x , A £ A}, we introduce 
an auxiliary family {S m , m G M} of discrete subsets of Co satisfying the 
following assumption. 

Assumption 3 (r, M). For all m G M and s G Cq, there exists r} m > 1/2 
such that 

\S m nB{s,r^)\ < Mexpf^ , Vr > 2 Vm . 

As we shall see, the parameter r] m is convenient to measure the massiveness 
of the discrete set S m . It is related to a metric dimension (in a sense to be 
specified later on). 

Assumptions 2 and 3 are related to our former Assumption 1 by the following 
result. 

Lemma 1. // Assumptions 2 and 3 hold with r = 4(2 + cz) / (az 2 ) then the 
collection of models {S m ,m G Ai} satisfy Assumption 1 with 7 = bM 2 and 
d m = 4r?^ for all m G M.. 

Consider now the following selection procedure. 

Procedure 2. Let pen be some penalty function from S = Umex into 
R_i_. To each A G A, associate the auxiliary estimator s~\ as any element of 
S satisfying 

H 2 (h, S x ) + pen(S A ) < A(s x ,S) + r 

where 

A(s x ,S) = in{[H 2 (s x ,t) + pen(t)] 

Select A among A by using Procedure 1 with the family of estimators {s\, A G A}. 
Finally, select A as any element of A such that 

H 2 (rs x ,~s- x )<mi A H 2 (s x ,h) + r. 

The following holds. 

Theorem 2. Let M be a positive number and {A m , m G A4} a family of 
numbers satisfying (9). Assume that Assumption 2 and 3 hold with r = 
4(2+cz)/(az 2 ). Let s = s x be the estimator obtained by selecting A according 
to Procedure 2 with 

(15) pen(t) > zt inf Ur) 2 m + A m ) Vt G S. 

meM(t) 

Then, for all £ > 0, 



H 2 (s, s)>C 



inf (H 2 (s,s x ) + A(S x ,S)) +re 
AeA 



< (bM 2 T?e~^ A 1, 
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and 

C'E [H 2 (s, §)] < E inf {H 2 {s,s x ) + A{s x ,S)} + r [(5M 2 S 2 ) V l] 



AeA 

< inf {E [H 2 (a, s x ) + A(S A , 5)] } + r [(6M 2 S 2 ) V l] . 
where C, C" are positive numbers depending on z only. 

The risk bound we get involves the quantity A(s Xl S) which depends on the 
approximation property of S with respect to the (random) family {s\, A G A} C 
§. In the favorable situation where the s x take their values in S and if equal- 
ity holds in (15), then 

A(s x ,S) < pen(s A ) < zt (4?4 (a) + A m(A)) • 

In a more general case, one needs to choose 5 to possess good approximation 
properties with respect to the elements of § in order to keep the quantity 
A(s\,S) as small as possible for all AeA. To ensure such a property, it is 
convenient to choose S m as a suitable discretization of S m for all m G M. 

Definition 1. Let S be a subset of (Co,H) and e some positive number. 
We shall say that S is an e-net for S if S C S and if for all t £ S, there 
exists t' G S such that H(t, t') < e. For nonnegative numbers M, D, we shall 
specify that S is an (M, e, D)-net for S if for all s £ Co and r > 2e, 

2" 



(16) \{t g S, H(s,t) < r}\ < Mexp 



D 



(;)' 



The parameter D corresponds to an upper bound to what is usually called 
the metric dimension of S (we refer to Birge (2006), Definition 6). Under 
suitable assumptions and provided that the e-net has been suitably chosen, 
the metric dimension D of S provides an upper bound (up to a suitable 
renormalisation) for the minimax estimation rate over S. In many cases 
of interest, it turns that D actually provides the right order of magnitude 
but, unfortunately, not always. For a complete discussion with examples 
and counter-examples on the connection between metric dimensions and 
minimax estimation rates we refer the reader to Birge (1983) and Yang and 
Barron (1999). 

We deduce from Theorem 2 the following corollary. 

Corollary 3. Let M be a positive number and {A m , m G A4} a family of 
nonnegative numbers satisfying (9). Assume that Assumption 2 holds and 
that form G M, S m is a (M,rj m ^/r, D m )-net for S m withr = 4(2+cz)/(az 2 ) 
and = 2(D m V 1/8). // equality holds in (15), the estimator s defined in 
Theorem 2 satisfies 

(17) E[H 2 (s,~s)] < CM{E[H 2 (s,s x )]+tE[D mx) VA mx) ]} 
where C is a constant depending on z,M and S only. 
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Since the statistician is free to choose m(A) any element among M(s\), a 
natural choice in view of minimizing (17) is to take it as (any) minimizer of 
D m V A m among those m G M(s\). 

If one considers a family of estimators {s m , m G M} (here A = A4) such 
that s m belongs to S m with probability one, we deduce from Corollary 3 
that the estimator s = Sm satisfies, 

(18) E [H 2 (a, s)] <C inf {E [tf 2 (s, s m )] + r ( D m V A m ) } . 

Moreover, if for some universal constant c > the estimators s m satisfy 

E [H 2 (s, s m )] > ct D m , Vs G £ Vm G M, 
then (18) shows that s satisfies the oracle-type inequality 

E [H 2 (s, s)] < C inf {E [H 2 (s, s m )] V (rA m )} . 

4. Selecting among histogram-type estimators 

In this section we assume that Ai is a family of partitions of X and for m G 
A4, the set gathering the elements of Cq which are piecewise constant 
on each element of the partition m, that is 



We shall therefore consider a family {S m , meM} of such models and 
{s\, A G A} a family of estimators of the form ^i^a-i^-i, the values aj 
and the partition rh G M being allowed to be random depending on the 
observation N. 

Throughout this section, we assume that k = 1. The applications we have 
in mind include Examples 1 and 3 and also the following statistical setting. 

Example 4. We observe a vector X = (Xi, . . . , X n ) the components of 
which are independent and nonnegative with respective means Sj. Our aim 
is to estimate s = (si, . . . , s n ) on the basis of the observation of X. This 
statistical setting is a particular case of our general one described in Section 1 
by taking k = 1, X = {1, . . . , n}, A = V(X), fx the counting measure on 
(X,A), Co = C and N the measure defined for Ad X by 



Among the distributions we have in mind for the Xi, we mention the Bino- 
mial or Gamma. 

For partitions m, m! of X, we set 




N(A) = ^X t . 



x 2 { m ) = £ {y/m- VmU))) 



iem 
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and 



m 



V m = {I n (I, I') £ m x m'} . 



4.1. The assumptions. We assume that N satisfies 

Assumption 4. There exists a positive number r such that for all £ > 
and all partition m of X 



(19) 



[X 2 {m) > a(|m| +£)] < e" € . 



Besides, we assume that the family of partitions M satisfies the following 

Assumption 5. There exists 5 > 1 such that \m V m'\ < 5 (\m\ V \m'\) for 
all m,m' G M.. 



These two assumptions also appeared in Baraud et Birge (2009) as Assump- 
tions H and H' in their Theorem 6. In particular, the following result is 
proven there 

Proposition 4. Assumption 4 holds with a = 200/n in the case of Exam- 
ple 1, with a = 6 in the case of Example 3 and, in the case of Example 4, 
with 



+ / 



provided that for some (3 > and k > 0, the Xi satisfy for i = 1, . . . , n 

1 



E 



u(Xi-Si) 



< exp 



U 2 Si 



2(1 -up) 

with the (convention 1/(3 = +oo if (3 = 0), and 

-u(Xi-Si) 



for all u G 



0, 



E 



< exp 



U 2 S4 



for all u > 0. 



Throughout this section, we set r = 20az 



4.2. The main result. 

Theorem 3. Assume that Assumptions 4 and 5 hold and that {A m , m £ Ai} 
satisfies (9). Consider a family {s\, A £ A} of estimators of s with values 
in 8. //pen is such that 

pen(t) > zt inf (S\m\ + A m ) Vt G S 

m£M(t) 

the estimator s = s~ x selected by Procedure 1 satisfies for some constant C 
depending on z only, 



CE [H 2 (s, 5)] < E inf [H 2 (s, s x ) + pen (s A )] 



+ t(S 2 V1) 
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The above result holds for any choices of estimators {s\, A £ A} with values 
in S. Of special interest are the estimators s m associated to a partition m 
of X by the formula 

Since when = 0, E(iV(J)) = fj sdfi = and N(I) = a.s., the estimator 
s m is well-defined with the conventions 0/0 = and c/oo = for all c > 0. 
One can prove (we refer to Baraud and Birge (2009)) that for all m G M, 

E [H 2 (s, s m )] < 4 (H 2 (s, S m ) + T\m\) . 

In the following sections, we shall apply Theorem 3 in order to choose among 
a family of such estimators. 

4.3. Model selection. Let M be a family partitions of X and associate to 
each m G M, the estimator s m defined by (20). We deduce from Theorem 3 
the following corollary. 

Corollary 4. Assume that Assumptions 4 and 5 hold and that {A m , m £ M } 
satisfies (9). Choose s = s^ among {s m , m G Ai} by using Procedure 1 and 

pen(s m ) = zt (6\m\ + A m ) Vm G M. 

Then there exists a constant C depending on z,S and S only such that 

CE[H 2 (s,s)] < inf {E[H 2 (s,s m )] + t (\m\ VA m )} 

< inf [H 2 {s,S rn )+T{\m\ V A m )] . 

This corollary recovers the results of Theorem 6 in Baraud and Birge (2009) 
even though the selection procedure is different. The choice of a suitable 
family M of partitions is of course a crucial point. It should be chosen in 
such a way that the family {S m , m G M.} possesses good approximation 
properties with respect to classes of functions s of interest. This point has 
been discussed in Baraud and Birge (2009) (see their Section 3). Another 
concern is the computational cost. In the case of density estimation, alterna- 
tive selection procedures based on the minimization of a penalized criterion 
over families M generated by an algorithm such as CART (or some related 
version) can be less time consuming. We refer for example to Blanchard et 
al (2004) which considers families of partitions associated to some dyadic 
decision trees. Their algorithm is inspired from that of Donoho (1997) in 
the context of regression in 2D. 

In view of reducing the computation cost of our selection procedure, we 
extend Corollary 4 to the case where the partitions m are possibly random, 
generated from the data themselves. 
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4.4. Selecting among model selection strategies. Assume now that 
each A G A is a model selection strategy allowing to choose a partition m(A) 
among a collection of candidate partitions Ai. Besides, to each A G A, 
associate the estimator s\ = s A ^) with s m defined by (20) for all m G Ai. 

By applying Theorem 3 to the collection {s\, A £ A} we get the following 
result. 

Corollary 5. Assume that Assumptions 4, 5 hold and that {A m , m 6 Ai} 
satisfies (9). Choose s = s A among {s\, A G A} fry using Procedure 1 and 

pen(s A ) = zt (S\m(X)\ + A A(A) ) VA G A. 

Then, for some constant C depending on z, 5 and £ on/y, 

CE[F 2 (M)] < mf E[fT 2 ( S ,5 A ) + r(|m(A)|VA^ (A) )] . 

Note that Corollary 5 shows that the risk of s can be related to those of 
the s\ but gives no hint on the orders of magnitude of the latters. Such 
a study is beyond the scope of this paper. In density estimation, Lugosi 
and Nobel (1996) tackled this problem by giving sufficient condition on the 
random partition m(A) to ensure the Li-consistency of the estimator s\, 
that is, under suitable conditions, they show that 

s — s\\ dfi — > a.s. 

as the sample size tends to infinity. Since 

H 2 (s,s x ) < \s- s x \ dn, 
Jx 

the same holds for distance H and by dominated convergence, we deduce 
that E [H 2 (s, s a )] also tends to as the sample size tends to infinity. 

We end this section by giving a simple way of choosing a family of parti- 
tions from the data by mean of a contrast. We shall assume for simplicity 
that X = [0, 1) and consider the family Ai of partitions of [0, 1) into in- 
tervals of the form [a, b) the endpoints of which belong to the regular grid 
{k/N, k = 0, . . . , iV} with N >2. For such a family, it is easy to check that 
the choice A m = \m\ \og{N — 1) ensures that (9) holds with E < e. In what 
follows, the notation m -< m! for m, m! G Ai means that the partition m! is 
thinner than m or equivalently that S rn C S rn /. Let us now introduce the 
criterion ciit(N,t) defined for t G S by 

crit(iV, t) = -2 / tdN+ / t 2 d\x. 

Jx Jx 

It is well-known that crit(iV, .) is a contrast on S and that if s belongs to 
L 2 ([0, 1),//), for all t,t' G S 

(21) E[cnt(N,t)-crit(N,t')]= [ (s - t) 2 d\i - [ (s - t') 2 dfi. 

Jx Jx 



18 



YANNICK BARAUD 



Then, given a partition m G M it is natural to associate to S m the esti- 
mator obtained by minimizing crit(7V, t) among those t in S m . It turns out 
that such a minimizer is actually given by s m . Since \M\ = 2 N ~ 1 is large 
for large values of N, we shall not consider the whole family of estimators 
{s m , m G M} over which our selection procedure could be practically use- 
less and rather focus on the (random) subfamily defined as follows. Let 
A = {1, . . . , N} and define m(l) the partition of [0, 1) reduced to {[0, 1)}. 
Then for A > 2, define by induction m(A) as the random partition mini- 
mizing crit(iV, s m ) among those m£M satisfying both m(X — \) < m and 
\m\ = A (in case of equality take one at random among the minimizers). 
Since for all A G A, S^x-i) c Sm(\)i note that the map A i— > crit(A r , Sm(A)) 
is decreasing with A and that m(N) corresponds to the regular partition 
based on the grid {k/N, k = 0, . . . , N}. Finally, set for A G A, s\ = s^)' 
For such a family, our procedure requires at most N 2 steps to obtain the 
family of partitions (for each value A, finding rh(X) requires at most N com- 
putations) and at most A^ 2 additional steps are required to proceed at the 
comparison pair by pair of the estimators s\ to finally get s = §y Conse- 
quently, the whole procedure requires of order A^ 2 steps and it follows from 
Corollary 5 that s satisfies 

CE\H 2 (s,s)]< inf {E \H 2 (s, S a )1 + rA log(N - 1)) . 
L J Ae{l,...,7V} ' 

5. Selecting among points 

We assume here that the estimators s\ are deterministic. In order to em- 
phasize the fact that they do not depend on N, these will be denoted s\ 
hereafter. The aim of this section is to show that our selection procedure 
allows to select among arbitrary points in Co and also provides an alter- 
native to the procedure based on testing proposed in Birge (2006) for the 
construction of T-estimators. The proofs of the following Propositions are 
delayed to Section 8.6. 

5.1. Aggregation of arbitrary points. Let {s\, A <E A} be a countable 
family of arbitrary points of Cq. Typically, one should think of the s\ as 
estimators of s based on an independent copy N' of N. In this case, with 
no loss of generality we may assume that A = M and S m = {s m } for all 
m G M. Then, the following result should be understood as conditional to 
N' . 

Proposition 5. Assume that Assumption 2 holds, set r = 4(2 + cz)/(az 2 ), 
and take {A m , m £ Ai} satisfying (9). Choose s = Sm among {s m , m £ Ai} 
according to Procedure 1 with 

pen(s m ) = zrA m , Mm G M. 



ESTIMATOR SELECTION 



19 



Then, 

E [H 2 (s, §)] < C inf [H 2 (s, s m ) + rA m ] 

meM 

where C depends on z, b, M and £ only. 

Our procedure also allows to handle the problem of convex aggregation from 

1.1. d. observations in the same way as Birge did in Section 9 of Birge (2006). 
We shall not detail this in the present paper and rather refer to the paper 
by Birge for examples and references. 

5.2. Selecting among discretized subsets of Cq. For each m G M, let 
S m = {s\, A G A(m)} be a discrete subset of Co. Taking A = Umex A(ra), 
we consider the family {s\, A G A} obtained by gathering all these dis- 
cretization points. The following holds. 

Proposition 6. Let M be a positive number and {A m , m G M} a family 
of nonnegative numbers satisfying (9). Assume that Assumptions 2 and 3 
hold with t = 4(2 + cz)/(az 2 ). By applying Procedure 1 with the family of 
estimators {s\, A G A} and 

pen(t) = zTini {4ij 2 m + A m , m G M(t)} , Vt G S 

the estimator s satisfies 

(22) E [H 2 (s, §)] < C inf [H 2 (s, S m ) +r(rfi n V A m )] , 
where C depends on z, b, M and S only. 

If moreover S m is a (M, T] m ^/r, D m )-net for S rn with rj m = 2(D m V 1/8) for 
all m G M., then, 

(23) E [H 2 (s, s)] < C inf [H 2 (a, S m ) + r (D m V A m )] , 

meM 

where C depends on z,b, M and S only. 

In density estimation, an inequality such as (23) also holds for T-estimators 
as proven in Birge (2006) (see his Theorem 5). For suitable choices of collec- 
tions {S m , rn G .M}, an estimator s satisfying (23) possesses nice optimal 
properties (in the minimax sense) and outperforms in some situations the 
classical maximum likelihood estimator. For more details, we refer the reader 
to the paper of Birge mentioned above. 

Assume now that for all (deterministic) m G M, one is able to build an 
estimator s m (depending on N) with values in S m with a risk satisfying for 
some universal constant C, 

(24) E [H 2 (s, s m )} < C (H 2 (s, S m ) + D m ) , Vs G C . 

By selecting among the family {s m , m G M} with Procedure 2, one obtains 
an estimator s' = s m which also satisfies an inequality such as (23) (this 
easily derives from (18)). Consequently, from a theoretical point of view 
both estimators s and s' possess similar properties. If the estimators s m can 
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be built in a simple way, the advantage of s' compared to s is rather practical 
since the former requires the comparison pair by pair of the estimators s m 
only although the latter requires that of all the pairs of s\. This shows the 
use of the discretization device is actually useful only when no estimator 
s m satisfying (24) is available. This seems to be often the case when the 
models S m are not linear spaces or when the maximum likelihood estimator 
performs poorly. 

Finally, we mention that a careful look at the proof of Theorem 5 in Birge (2006) 
shows that the selection rule described there could also be used to select 
among the estimators s m in the sense that the resulting estimator would 
also satisfy an analogue of (23). 

6. Estimating the means of nonnegative random variables 

In this section, we consider the statistical setting described in Example 4. 
Hereafter, we shall assume that y/s belongs to some closed convex subset C of 
R" . Since the distance H between two elements t, t' 6 R" corresponds to the 
Euclidean distance between y/t and y/t/, it seems natural to approximate the 
parameter y/s with respect to the Euclidean norm. To do so, we introduce a 
family of linear subspaces {V rn , mGA^} of W 1 with respective dimensions 
denoted D m that correspond to approximation spaces for y/s. We associate 
to each of these the sets V m for m G M which are either given by V m = 
V m n C or V m = H^V m . Finally, we consider the models S m defined for 
m £ M. by 

S m = <\>~ X (V m ) = {(«?, • • -,ul), U £ V rn } 

where cj)(t) = xft for t £ R^. 

Two examples of collections {V m , m £ M} are given below. 

Problem 1 (The regression problem). Assume that y/s = (F(x±), . . . , F(x n )) 
where the x% are deterministic points on [0, 1] and F is a function from [0, 1] 
into R+. Note that the problem we deal with can be written in a regression 
setting as follows 

Xi = F 2 (xi) + Si, i = l,...,n 

where the Si = Xi — F 2 (xi) are independent and centered random variables. 
The problem is to estimate s = (F 2 (x\), . . . , F 2 (x n )). 

In order to approximate y/s = (F(x\), . . . , F(x n )), it is natural to introduce 
linear spaces {V m , m £ Ai] having good approximation properties with re- 
spect to usual classes of functions F such as Besov spaces. For a > 
and p £ [l,+oo], B° (R) denotes the ball of radius R > of the Besov 
space Bp OQ . For a precise definition of these spaces, we refer to DeVore and 
Lorentz (1993). The following result derives from Theorem 1 and Proposi- 
tion 1 in Birge & Massart (2000). 
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Proposition 7. For all r G N \ {0} and J G N, i/iere exists a family 
{^(m,r)) m ^ -A^ r (J)} of linear subspaces o/L 2 ([0, l],dx) and positive num- 
bers C (r) , C \r) , C" (r) such that D {m ^ r) = dim(V (nv .)) < C(r)2 J , log|M r (J)| < 
C"(r)2 J and /or a// a G (1/p, r) and a// / G B£ )0O (i2), 



inf <j sup - 5 (x)| , <7 G (J V (m , r) }> < C"{r)RT 

xe l°A (m,r)eM r (J) 



Ja 



Thus, for handling Problem 1 we shall consider M. = [j r>1 Uj>o -M-r(J), 
and for all m = (m, r) G M, take 

= {(g(xi), • • • ,g{xn)), 9 G V m } , F m = I%F m and 5 m = _1 (y m ). 

Besides, by taking for m = (m, r) G A4 r (J), A m = (C"(r) + 1)2 J + r note 
that so that (9) holds since 

j2 e - A ™ < Y,T,\ M r( J )\ e ~ (c ' (r)+1)2J ~ r <j2 e ~ r J2 e ~ c ' (r)2J < +°°- 

m6X r>l J>0 r>l J>0 

Let us now turn to another problem. 

Problem 2 (The variable selection problem). We assume that y/s is of the 
form 

where (3 = ((3i, . . . , (3 p ) is an unknown vector of MP and v^\...,v^ are 
p > 2 known vectors in W l . This means that the (squared) mean of each Xi 
is a linear combination of the values of the predictor for j = 1, . . . ,p 
at experiment i. Since, the number of predictors p may be large and possibly 
larger than the number n of data, we shall assume that the vector (3 is sparse 
which means that 

\{j, &^0}|< Anax 

for a known integer -D max < n. Our aim is to estimate ^/s and the set 
{j, Pj + 0}. 

For this problem, we consider any class A4 of subsets m of {1, . . . ,p} with 
cardinality not larger than D max , and define for m G M, V m = V m DC where 
V m is the linear span of the for j£m (with the convention V = {0}). 

6.1. Assumption on the Xj. We assume the following 

Assumption 6. The random variables Xi are independent nonnegative ran- 
dom variable with respective means Si satisfying for some nonnegative num- 
bers a and (3 



(25) max E 



.71 



e u(Xi~Si) 



< exp 



u 2 asj 



[2(l-\u\0) 



VnG (-1/(3,1/(5). 
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This assumption holds for a large class of distributions including, any ran- 
dom variables with values in [0,(3] (then a = (3), the Binomial distribution 
(then a = 1 = (3), the Poisson distribution (for the same choice of parame- 
ters), or the Gamma distribution j(p, q) (with mean p/q and (3 = 1/q = a). 
By expanding (25) in a vicinity of 0, it is easy to see that Assumption 6 
implies that Var(Aj) < crE(Aj) for alH = 1, . . . , n. 

In the remaining part of this section, under Assumption 6, we shall set 

(26) r= 96(, + / ?) 



z 2 ' 

6.2. Discretizing the S m . To each m € M such that S m / {0}, we apply 
with V = V m , V = V in and S — S m one of the two discretization procedures 
described below (accordingly to the form of V m ). These procedures lead to 
a discretized subset S m of S m associated to a parameter 77 = rj m depending 
on the dimension of V m . 

The first procedure below is abstract and is based on a discretization argu- 
ment introduced in Birge (2006). The resulting set S, though difficult to 
build in practice, possesses nice properties with respect to the original set 
S. We shall not detail the construction of S here and rather refer the reader 
to the proof in Section 8.8. We only present its properties. We shall use 
them in order to obtain new results on the estimation of the parameter s. 

Discretization PI. We assume here that S = 4>^ 1 (V) where V is of the form 
HqV for some linear subspace V of M n with dimension D > 1. We associate 
to S the parameter 

(27) tj 2 = 2x 4.2D 

together with a discretized subset S with the following properties. 

Proposition 8. There exists a discretized subset S of S which satisfies 
Assumption 3 with M = 1 and r and 77 given by (26) and (27) respectively. 
Moreover, H(t, S) < AH(t, S) for all t G C. 

The procedure below is much simpler than the one above but unfortunately 
not as powerful. Yet, it turns to be enough to handle Problem 2. 

Discretization P2. We assume here that S = _1 (y) with V is of the form 
V H C where V is a linear subspace of M n with dimension D > 1. Let LTy be 
the projector onto the closed convex set V and T the subset of V given by 

(28) r = I ^ J2 k ^ ^, 1 <> G 

where {u\, . . . , ujj} is an orthonormal basis of V and 

(29) 7] 2 = 2 x 1.03115. 
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Keep only the elements of T which are at distance not larger than r/^/r of 
V, that is, those of 

T{rj) = \ t G T, inf \\t - v\\ < rjy/r 
[ vev 

and define finally 

S = <fT X (7) where T = {U v t, t G T(r/)} . 

The subset S C S satisfies the following. 

Proposition 9. The subsets is an (1, rj^/r, 1.031D m )-net of S with r and 
rj given by (26) and (29) respectively. 

The proof is delayed to Section 8.9. 

6.3. The results. We have at disposal the family of discretized subsets S m 
of S m which have been built in the previous section. We recall that each 
of these S m are associated to a parameter r\ m > 0. We consider here the 
discretization points {s\, AG A(m)} = S m for m G M. and the family of 
estimators {s\, A G A = [J m&M A(m)} obtained by gathering those. For 
such a family, the following holds: 

Theorem 4. Assume that Assumption 6 holds and let {A m , m G M} be a 
family of weights satisfying (9). Choose s = s^ among the family {s\, A G A} 
according to Procedure 1 with pen satisfying 

(30) pen(t) = zt inf (Ar, 2 m + A m ) Vt G S. 

m<=M(t) 

Then, 

CE[H 2 (s,s)]< inf {H 2 (s,S m )+r(D m V A m )} 

L J m<=M 

where r is defined by (26) and C depends on z and E only. 

We deduce from Theorem 4 the following risk bounds: 
Corollary 6. Assume Assumption 6 holds. Then, 

(i) for any m G M, there exists an estimator s m satisfying 

(31) sup E [H 2 (a, s m )] <C(D rn Vl), 

ses m 

where C depends on z,a and (3 only; 
(ii) for Problem 1, there exists an estimator F such that for all p G 
[1, +oo], a > 1/p and R > 1/n 

sup E 

where C depends on R,a,p,a, z and (3; 



-, n 
1=1 



< Ci?2/(l+2«) n -2a/(l+2a) ) 
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(Hi) for Problem 2, by applying Procedure 1 with weights A m satisfy- 
ing (9), one selects a family of predictors {v^\ j G to} and builds 
an estimator s £ S r - n such that, 

E [H 2 (s, S)] < C inf [H\s, S m ) + \m\ V A m ] , 
where C depends on z,a, (3 only. 

To our knowledge, Example 4 has received little attention in the literature, 
especially from a non-asymptotic point of view. The only exceptions we are 
aware of are Antoniadis, Besbeas and Sapatinas (2001) (see also Antoniadis 
and Sapatinas (2001)) and Kolaczyk and Nowak (2004). These papers con- 
sider the case where s is of the form (F(xi), . . . , F(x n )) for some function F 
on [0, 1]. In Antoniadis, Besbeas and Sapatinas (2001), the authors estimate 
F by a wavelet shrinkage procedure and show that the resulting estimator 
achieves the usual estimation rate of convergence over Sobolev classes with 
smoothness indexes larger than 1/2. Kolaczyk and Nowak (2004) study the 
risk properties of some thresholding and partitioning estimators. There ap- 
proach requires that the Sj be bounded from above and below by positive 
numbers. Finally, Baraud and Birge (2009) tackle this problem but their 
approach restricts to the case of histogram-type estimators. In particular, 
the estimation rates they get hold for a < 1 only. 

6.4. Lower bounds. The aim of this section is to show that the upper 
bound (31) gives the right order of magnitude for the minimax rate over 
S m , at least under the following assumptions. 

Assumption 7. The distribution of the random vector X = (X±, . . . , X n ) 
belongs to an exponential family of the form 

n 

(g)dv{ Xi ) with 6 G O™ 

i=l 



(32) dP e = exp 



n 



Y, WiTM - A(0i)) 

where v denotes some measure on M + , T is a map from M + to R, 0i are 
parameters belonging to an open interval such that 



C < a £ 



I, J exp [aT(x)\ dv(x) < +oo 



and A denotes a smooth function from into M. satisfying A" (a) / for all 
a GO. 

These families include Poisson, Binomial and Gamma distributions (among 
others). Besides, it is well known that A is infinitely differentiable on and 
under Pq, the X^ satisfy 

E [Xi] = A'(9i) = si and Var(Xj) = A"(0i) > 0, Vi = 1, . . . , n. 

Therefore, the unknown parameter y/s necessarily belongs to the open cube 
C = I n where / denotes the interval (j) (A'(@)). 
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We shall also assume that the parameter space G is such that the following 
holds. 

Assumption 8. There exists some k > such that for all G n , under 
Pe 

(33) < E(Xi) < KVar (J*Q) Vi = 1, . . . , n. 

Since A', A" are continuous and positive functions, such an assumption is 
automatically fulfilled by choosing 6 such that O is compact and A' and A" 
positive on 0. 

Theorem 5. Let V be a linear subspace of W 1 with dimension D > 1 and 
S = <f>~ 1 (VnC). Define 

K = {r G (0, (2^)~ 1 ), 3u eVD C, {u G F, ||u - u \\ < r) C C} . 

Under Assumptions 7 and 8, 

infsupE s \H 2 (s,s)] > ^-supr 2 , 
s se5 3U re7 ^ 

wit/i the convention sup = 0. 



7. Estimation and variable selection in non-Gaussian 

regression 



In this section, we use the notations of Example 2 and assume that we 
observe the random variables X\ , . . . , X n satisfying 



Xi — fi + Si 



1, 



, n 



where / = (f±, . . . , f n ) is an unknown vector of W 1 and the £j i.i.d. random 
variables with known density q on R. Hereafter, we consider a family of 
linear subspaces { V m , m G M } of M ra with respective dimensions denoted 

D rn and A G a| a family of estimators of / with values in [j meM V m 

based on the observation of X = (Xi, . . . , X n ). 

For example, when / is assumed to be of the form (F(x±), . . . ,F(x n )) for 
some function F and points xi,...,x n in [0, 1] one can use the collection of 
linear spaces introduced to takle Problem 1. Alternatively, if one assumes 
that / is of the form / = YTj=i fijV^ as in Problem 2, one can use the 
collection of V m defined there to perform variable selection. 

As possible estimators, one can associate to each V m the least-squares es- 
timator of / in V m defined as f m = H m X where H m is the orthogonal 
projector onto V m . 



(34) 



E 



It is well-known that 

2" 



f-fr, 



\\f-U m f\\ z + D m a 2 
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where a 2 denotes the variance of the e\ (provided that it is finite). In the 
context of variable selection, many efforts have been done to design (practi- 
cal) selection rules among the predictors. Among the most popular proce- 
dures, we mention the Lasso and the Dantzig selector described respectively 
in Tibshirani (1996) and Candes and Tao (2007). Given a family A of such 
procedures, an alternative family of estimators for / could be that given by 
|/m(A)> A G A j where m(A) corresponds to the family of predictors selected 
by the procedure A in A. 
We shall assume the following 

Assumption 9. There exists some known positive numbers R, R, R such 
that maxj=i v .. )n < R and for allr,r' £ [— R, R], 

(35) R \i — r'l < h (q r , q r i) < R \r — r'\ 

where q r (x) = q(x — r) for all x,r 6 M and h is the Hellinger distance 
between the densities q r and q r > . 

Throughout, we denote by C the cube [— R, R] n , q g = (q gi , ■ ■ ■ , q g „) for g € 
M. n and Co = {q g , g G C}. Assumption 9 implies that (Cq,H) is almost 
isometric to (C, || ||) . 

Assumption 9 holds if ^fq is regular enough (see Theorem 3A page 183 
in Borovkov (1998)). The quantities R and R then depend on the Fisher 
information. Let us now turn to some examples. 

If for some known 9 > 

q ( x ) = 9 _ e -8\*\ i x£R 

then, h 2 (q r , q r ,) = 1 - e - s \ r - r '\/ 2 (1 + 9 \r - r'\ /2) and (35) holds with if = 
1/2 and R 2 = (1 - e^^l + 9R))/(9 2 R 2 ). Assumption 9 can also be met 
even though the £j have no finite moments. For example, this is the case for 



1{X) ~ 2(1 + |x|) 2 ' XG 



Indeed, 



r2/ ^ in l\\ , / n 2(1 + x) log(l + x) 

h {q r ,q r ')=^{\r-r \) with ip(x) = 1 - - 



x(2 + x) 

and since ip(x)/x 2 is decreasing on M + and tends to 1/2 when x tends to 
0+, Inequality (35) holds with R 2 = 1/2 and R 2 = ip(2R)/(2R 2 ). 

7.1. The procedure and the results. Throughout this section, r = 
50z~ 2 . To each estimator fx with A G A, we associate the estimator of 
qf given by s\ = . Our selection procedure is based on a suitable family 
of discretized subsets of Cq. Let us introduce two of these. 
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Collection (CI). For all m G M., let us set 

S m = {q g , g G U^V m } and S m = {%, 5 G T m ) 

where T m is the discretized set T obtained by applying Discretization PI 
with V = V m and 



2 —2 



2 x 



3iF 



Collection (C2). For all m G .M, let us set C m = C n V m 

S m = {q g , 9 e C m } and 5 m = {<? 3 , 5 G 7^} 

7^ is the discretized set T obtained by Discretization P2 with V = V r 
and 

\m\D m 



2 —2 
*7 =7 ?m 



2 x 



R l 



We obtain the following result. 

Theorem 6. Let jA, A G a| be an arbitrary (countable) family of esti- 
mators with values in Umex ^ m an< ^ i^-m, m £ Ai} a family of weights 
fulfilling (9). Assume that Assumption 9 holds. 

By applying Procedure 2 with the family of estimators {s\, A G A}, the fam- 
ily {S m , m G M} given by Collection Ci, and 



pen(i) = zt inf < 4 x 



+ A m , m€M(t)> Vt G S, 



2D r 



one selects from the data some A G A for which the estimator f = f\ satisfies 
for some constant C depending on z, R, R and E 



CE 







2 








2 




f-f 




< inf |e 




f-fx 








AeA [ 







+ E [D mW V A m{x) ] 



By applying Procedure 2 with the family of estimators A G A}, the fam- 
ily {S m , meM) given by Collection C2, and 

pen(t) = zTmi {4 x 2AD m + A m , m G M (*)} Vt G S, 

one selects from the data some A G A for which the estimator f = f\ satisfies 
for some constant C depending on z, R, R and E 



CE 
where 



f-f 



< inf < E 

AeA 



f-h 



+ E 



B(A)1 +E[ J D A(A) VA A(A) ] 



5(/a 



inf 



fx-t 



, t€C 



rh(X) 
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If the family of estimators f\ take their values in (UmeX ^ ^ then 
B(f\) = and the same risk bound for / is achievable with both Collections 
C2 and CI. 

The proof of Theorem 6 is postponed to Section 8.13. 

For illustration, we deduce the following corollaries in the context of variable 



selection. Hereafter, we consider the family of linear spaces \ V r 
given in Problem 2. 



m 



G M) 



Corollary 7. For m G Ai, let {f\, A G A(m)} be any countable and dense 
subset of V m . Define 771(A) = m if A G A(tt7) and apply the procedure de- 
scribed in Theorem 6 with the collection Ci and the family of estimators 
{fx, A G Urnex^ 771 )}' Under Assumption 9, one selects a subset of pre- 
dictors jvtf), j G rn(A)} for which the estimator f = f^ G V m fy sa ^ s fi es 



E 



/ 



/III < C mf 



{\\f 



n, 



1 

ff + D VA ' 



where C depends on z,R,R and S. 



Provided that the distribution of the errors is known and the mean / 
bounded by some known constant, this result shows that complete vari- 
able selection is possible even though the errors may not admit any finite 
moments. 

Let us now turn to some result showing how to select among families of least- 
squares estimators |/ m , m G as those introduced at the beginning of 

the section. Hereafter we take, A = M, choose m(A) = m for all A G A and 
define m* as any minimizer of \m\ V A m among those m G M. such / G C m . 

Corollary 8. Assume that a < +00 and that Assumption 9 holds true. Let 
{A m , m G A4} be a family of weights satisfying (9). Consider the family 

of least-squares estimators \ f m = H m X, m G M \ and apply the selection 



procedure described in Theorem 6 with the collection (C2). The resulting 
estimator f G Vm satisfies, 



CE 



f-f 



< E 



f-f. 



V A r 



where C depends on z, R,R, S and a. 



Proof Note that B(/ m .) < f m * - f 
by applying Theorem 6 and choosing A 
above. 



since / G C m *. The result follows 
= m* to bound the infimum from 

□ 
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8. Proofs 

8.1. Proof of Theorem 1. Throughout, k = z + l/\/2. Hereafter, we fix 
some estimator s\ and assume first that £(s\) ^ 0. Therefore, there exists 
sy £ £(s\) with s\r / s\. By using Proposition 2 with r = (a A + sy)/2 and 
the fact that T(N, s\, sy) > 0, we get 



H 2 (s,sy)-H 2 (s,s x ) 



p(Ma) ~\J x ^ 



dp 



p(s,Sy) - ^ J^Sydp 



and therefore, 



1 



= -T(iV, s x ,sy) +pen(s A ) - pen(sy) 

+ [P 0> «a) - Pr (sdfi, S X )} ~ [p 0, Sy) ~ Pr {sdp, Sy)} 

+ [p r (sdfl, S\) - Pr {sdp, S\)] - [p r {sdp, Sy) - p r (sdfl, Sy)} 

< _L [H 2 (s,s x ) + H 2 (s,sy)] 

+Z(N, sy sy) + pen(s A ) - pen(s A /) 



1 



\ - —jH 2 (s, sy) < (1 + -j=] H 2 (s, h)+Z(N, sy sy)+pen(s x )-pen(sy). 

For £ > 0, let us set 

y 2 (m, m, £) = t (d m V d m > + A m + A m / + £) , 

and 

Z(JV,t,f) 



n «= n { 



sup 

)G5 m xS, 



, w 2 (t,t',y(m,m',£)) 



<z\. 



Note that under Assumption 1, P(fi^) > 1 - 7£ 2 e~ 5 . On 

Z(JV,a A ,S A < ^ 2 (va) + ^ 2 (Ma') 

+zinf {y 2 (m,m',£), (m,m) G -M(s A ) x A4(s A ')} 
< ztf 2 (a,s A ) + zF 2 (s,s v ) 



+zr inf (eZ m + d m / + A m + A m > + £) 

(m,m')6A4(s,\)x^(sy ) 



and since for all A £ A, 



pen(s A ) > zt inf (d m + A m ) , 



we have 



(1 - «) F 2 (a, S A ,) < (1 + k) # 2 (a, s A ) + 2 pen(s A ) + zr£. 
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Let us set a = (l + /c)/(l — n). Since s\i is arbitrary among £(s\), we deduce 
that on Q^, 

V(s x ) = sup H 2 (s x ,s y ) 

< (l + ^)H 2 ( S ,s x )+(l + ^-) sup H 2 (s,sy) 

^ i—\1tt1, ~ x 2(1 + Jo) . z(l + v/a) ,. 

< + g 2 (s,g A ) + ^=/pen( gA )+ ^_L_ ; rg. 

Note that this bounds is obviously true if £(s\) = since then V(s\) = 0. 
Now by using that 'D(s^) < T>(s\) + r, we obtain 

H 2 (s,s x ) < (2 + ^)H 2 (s,s x ) + ( k 1 + y^^) h2 (^^x) 

< (2 + y/Z) H 2 (s, s x ) + (l + YTTa) V V(h) ^ 

< (2 + ^fl a (.,^+(l + n ^)w+(l + n ^) 
(2 + y^) # 2 (s, s A ) + pen(s A ) + 



< (2 + y/a) 



z(2 + ^) 



< Ci(z) [tf 2 (a, J A ) + pen( SA )] + C 2 (z)t£ 
with k = z + 1/ \/2, a = (1 + /c)/(l — re) and 

(36) = (2 + ^)max|(2+ v ^), 7f |=+ z(i+ 1 ^_ ) } 



,37) - 

Finally, we conclude by using that P (fi^) > 1 — jT, 2 e~^ and the fact that 
s A is arbitrary. 

8.2. Proof of Lemma 1. Let £ > and 

2/ 2 > r [4 ( V 2 m V r^,) + £] > 4r V t? 2 ,,) . 

We set C = (S m n S(s, y)) x (<S m / n y)) and for j > 1, 

C,- = {(M') e 5 m x S m ,, V- l y 2 < H 2 (s,t) + H 2 (s,t') < Vy 2 } . 

Note that for all j > 0, Cj C (S m n B(s, 2^ 2 y)) x («S m n 2^ 2 y)) and 
that for (t, f ) G Cj, w 2 {t, t', y) = (H 2 {s, t) + H 2 {s, t')) V y 2 > (2^ 1 V l)y 2 . 
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By using Assumptions 2 and 3, we get 
Z(N,t,t') 



sup 

(t,t>)eS m xS m , w 2 {t,t',y) 
< V[Z(N,t,t')>zy 2 ]+Y^ Yl ^[Z{N,t,l!)>z^-W 



> z 



(t,t')ec i>i (t,t')eCj 

< b\S m nB(s,y)\\S m ,nB(8,y)\exp 



1 4 

az y 



y 2 + czy 2 

+b £ \S m n B(s, y' 2 y)\ \S m > n B(s, y' 2 y)\ exp 



2-?y 2 + cz2J- V 



< 6M 2 exp 



< 6M 2 ^ ex P 

i>o 



r 1 + cz 

-yy 2 - 



+ 6M 2 ^ ex P 

i>i 



r 2(2 + cz) 



2 j y 2 



recalling that r = 4(2 + cz)/{az 2 ). By using that 

r-y >4(^vi)+^> i+e 

and the inequality 2 J > j + 1 which holds for all j > 0, we finally obtain 
Z(iV,M') 



(t,t>)eS m xS m , w 2 (t,t',y) 



> z 



< 6M 2 ^exp[-(i + l)(l + 0] 

i>o 

< 6M 2 e~ ? . 



8.3. The proof of Theorem 2. The proof relies on Theorem 1 and Lemma 1 . 

By definition of s, for all A £ A 

tf 2 (M) < 2F 2 (^)+2tf 2 (^,%) 

(38) < 2# 2 (s,s x ) + 2tf 2 (S x ,s A ) + 2r. 

By Lemma 1, the collection of models {S m , m € A4} satisfies Assumption 1, 
we can therefore apply Theorem 1 with the family of estimators {s x , A £ A} 
and get that with probability at least 1 — jY, 2 e~^ (with 7 = bM 2 ), 

H 2 (s,~s~ x ) < C(z)[H 2 ( S ,~sx) + pen(~ Sx )+T(Z + l)} 

< C(z) [2H 2 (s, s x ) + 2H 2 (s x , ~sx) + pen(5 A ) + r(£ + 1)] 
which with (38) and the fact that pen(s A ) > tz (A m > 1 for all m) leads to 
H 2 (s,S) < C'(z)[H 2 (s,s x )+A(s x ,S)+T((i + l)] 

and conclude the proof of the first part of Theorem 2. The second part 
follows by integration with respect to £. 
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8.4. Proof of Proposition 3. 

Case of Examples 1 and 2. It suffices to prove the result in the case of 
Example 2, the result for Example 1 being obtained similarly by changing 
Z(N,t,t') into Z(N,t,t')/n. 

Note that for all t,t' G £ , 

n 

Z(N, t, f ) = J2 (V>(^ 4 Xi) - e tyftX ) 
1=1 

is a sum of independent and centered random variables bounded by y/2. 
Besides, by setting r{ = (U + t' i )/2 for z = 1, . . . ,n and using that for all 
Xi G A'j, V i-(xj) /ri(xi) < 2 we have 

2 



j=l v ' 
< 2> ; / IvX- Jt>) 1,/^-lYe^ 



2 £ /• V^) 2 ( ^ _ ^2 rf + 4i? 2 (t ^ 



i=l JXi 

< 8(H 2 (s,r)+H 2 (s,t) + H 2 (s,t')) . 

Since the concavity of u i— ► -^/u implies 2.ff 2 (s, r) < -fT 2 (s, t) + H 2 (s, i'), we 
have obtain that i, t' G 2?(s, y) 

Var (Z(t, if, N)) < E [Z 2 {t, t' , N)] < 3 [H 2 (s, t) + # 2 (s, if)] < 6y 2 . 

By applying Bernstein's inequality, we obtain that Assumption 2 is fulfilled 
with 6 = 1, a = 1/6 and c = y/2/6. 

Case of Example 3. In this case, 

Z(t,t',N) = [ iJ;(t,t',x)(dN(x)- s(x)dfi) 
Jx 

where ip is bounded with values in [— 1/\/2, l/\/2] and, by arguing as in 
Section 8.4, it satisfies 



V- 2 (t, t', x) s(x)dfi < 3 {H 2 (s, t) + # 2 (s, t')) < 6y 2 



l W • ' ■ -''J -M ./';«/' ^ •> [it • «. ;;-r;; ( .s. ( 

/A" 

for all G B(s,y). By applying Proposition 7 in Reynaud-Bouret (2003) 
we obtain that Z(N,t,t') satisfies (2) with a = 1/12, b = 1 and c = \/2/36. 
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8.5. Proof of Corollary 3. Since for all m G M, r) 2 m = 2(D m V 1/8), 
Assumption 3 holds. We can therefore apply Theorem 2. Since for all 
A G A, S m is a ?7 A ( A - ) y / r-net for S m and since equality holds in (15), 

A(s x ,S) < r((l+4z)4 (A) + zA A(A) ) 

< t (2(1 + 4*)(Ah( A ) V (1/8)) + zA MX) ) 

< r (2(1 + 4z)(Ah (A )) + (1/4 + 2z) A A(A) ) , VA G A, 

which leads to the result. 



8.6. Proofs of Propositions 5 and 6. 

Proof of Proposition 5. Take for m G M = A, S m = S m = {%} and 
note that Assumption 3 holds with M = 1 and r/ m = 0. By Lemma 1, 
Assumption 3 with d m = and 7 = 6 and the result follows by applying 
Corollary 2. 

Proof of Proposition 6. Inequality (23) follows from (22) and the fact that 
since S rn is a (M, rj m ^/r, D m )-net for S m 

# 2 (s,S m ) < 2# 2 (s,S m ) +2rr4 < 2H 2 (s, S m ) + 4r(,D m V A m ). 

Since Assumption 1 holds from Lemma 1, we obtain (22) by applying Corol- 
lary 2 with s\ = s\, noting that 

inf [H 2 {s, s x ) + pen(s A )] < inf [H 2 {s,S m ) + zt (4 V 2 m + A m )] . 



8.7. Proof of Theorem 3. It suffices to prove that Assumption 1 holds 
with d m = S\m\ and then to apply Corollary 2. 

Let £ > and y such that 

y 2 > t (d m V d m , + . 

For m,m' G A4, i G S'm and t' G S 1 ^, £ and t' are constant on each element 
/ G m V m' with value ti, t'j respectively and therefore so is ip(t, t', .): 



1 

71 



1 + ij/i'j V 1 + 



Vx G /. 
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Consequently, by using that \ip(tj, t'j)\ < 1/ y/2 for all / and Cauchy-Schwarz 
inequality 

Z(N,t,t') = ^(t/,tj)(JV(/)-E[JV(7)]) 

/SmVm' 
/fraVra' 
/EraVm' 

+2 e mx)V^w» (VW)-VWU)\) 



JEmVm' 



< 



<Y 2 (m V m') 
^ 2 (m V m') 



+ 2 



E V 2 (i/,^)E(AT(/)) 

JGmVm' 

1/2 



1/2 



#(m Vm') 



ip 2 (t,t ,x)sd[i 



X{m\J m!) 



By arguing as in Section 8.4, we have that 

/ V 2 (i, x)sdfi < 3 (# 2 (s, t) + H 2 (s, t')) 
Jx 



and thus, by using that w 2 (t, t',y)> y 2 and w 2 (t, t',y)> (H 2 (s, t) + H 2 (s, t')) 1/2 y, 
we derive 



(t,t')eS m xS m/ w 2 (t,t',y) 

Since z £ (0, 1), 

Z(N,t,t') 



Z(N,t,t') < ^ 2 (mVm') +2 ^(mVm') 



< 



V2y 

2a/6 + 1 (X 2 (m\/m') %Vm') 



V2 V y 2 



(t,t')eS m xS m , w 2 (t,t',y) 



>z} C 



C 



^ 2 (m V m') #(m V m') > 2^/2 



^ 2 (m V m') 



> 



y 

2z 2 



2^/6 + 1 



(2^6 + 1)' 



and therefore 



Z(N,t,t') 
{t,t')eS m xS m , w 2 (t,t',y) 



< 



X 2 {m\J m!) > 



2„.2 



(2V6 + 1)' 



We conclude by using Assumption 4 with the fact that under Assumption 5, 



y 2 > r {d m V d m , + > 



{2V6+1Y 
2z 2 



a (|m V m'\ + £) . 
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8.8. Proof of Proposition 8. Given an ortho-normal basis {uj, j = 1, . . . , D}, 

consider the r/y^-net of V given by (28). Then, use Propositions 9 and 12 
in Birge (2006) (with vf = % (M',d) = (M", || ||), M = C, T = T and 
A = 1 = e) in order to build a subset T' of n^T" with the properties (8.15) 
and (8.16) given there. Finally, set S = (j)^ 1 (T'). The properties of S 
derives from those of T given in this Proposition 12. 



8.9. Proof of Proposition 9. In the sequel, d(.,.) denotes the Euclidean 
distance. By using Proposition 9 in Birge (2006), T is a 77-net for V satisfying 
for all s£l" and r > 2rj, 



(39) 



\{t G T, d(s,t) < r}\ < exp 



0.458D 



Since V C V, for all v G V there exists t G T(ry) such that c?(u, t) < rj and 

d(v,T') < d(u,IIv*) < d(u,i) < r/ 
and T is therefore an 77-net for V". 

Let s G M^. Assume that {t' G T', d(s,t') < r} / 0. There exists t in T(r/) 
such that t' = iryt and d(Ilyt, s) < r. For such a t, 



d(s, t) < d(s, U v t) + d(n y t, t) < r + 77 



and therefore 



\{t'eT', d(s,t')<r}\ < \{t€T( V ), d(s,t)<r + V }\ 

< \{t G T, d(s,t) < r + r]}\ . 

Of course the above inequality also holds if {t' G T' , d(s,t') <r} = 0. By 
using (39) and the fact that r + r\ < 1.5r, we get 

(40) \{t' G T', d{s,t') < r}\ < exp 1 .03125 ^ 

and conclude since (j) is is an isometry from (M" , //) into (M™ , d) . 



8.10. Proof of Theorem 4. The proof is based on Proposition 6. Let 
us first check that the assumptions of this proposition are fulfilled. We 
already know from Propositions 8 and 9 that Assumption 3 holds. It 
remains to check Assumption 2. Under Assumption 6, we have for all 
u = (ui, . . . ,u n ) G M n such that YH=i u 'i s i — 1,2 an d max" =1 \ui\ < 7, 
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and all AG (-l/(/? 7 ), l/(/? 7 )), 



E 



i=i 

n 



< JJexp 

i=i 

< exp 



\ 2 aufsi 
2(l-|A| 7 /3)_ 

AW 



In particular, for all A G (0, l/(/3 7 )), 



(41) 



E 



=*£i=i«iPQ-^) 



< exp 



2(1-|A| 7 /3)J 



aW 



_2(l-A 7 /3). 

Under (41), we derive from Bernstein's inequality (see Massart (2007), Corol- 
lary2.10), 



(42) 



< exp 



For t, t' G B(s, y) C R+, let us now take u = (ip(t, t',1),..., ip(t, t', n)) (where 
ip is denned by (8) on X = {1, . . . , n}) and note that 



^2^,1^,1) (Xi- 8i ) = Z(N,t,t') 

1 



i=i 



max i)\ < —= 

i=i,..,n |rv ;| - ^2 

Besides, by arguing as in Section 8.4, 



7- 



i=i 



i=i 



< 3H 2 (s,t) + 3H 2 (s,t') < 6y 2 = v 2 



Consequently, we deduce from (42) that Assumption 2 is satisfied with a = 
1/(12(7), 6 = 1 and c = /3\/2/(24cr) (then r < 96z- 2 (a + /?)). By applying 
the Proposition 6, we obtain (22) from which we deduce Theorem 4 since 
for the Discretizations PI and P2, the S m satisfy 

H 2 (s,S m ) < 16H 2 (s,S m ) + 2T7] 2 m , Vm G X. 



8.11. Proof of Corollary 6. Result (m) is direct from Theorem (4). For 
(i), take with Ai = {m}, A m = 1 and S m a discretization of S m obtained 
by PI or P2. Then, the result follows by applying Theorem 4 denoting s 
by s m . For (Hi), consider the collection of models S m described to handle 
Problem 1 and discretized them by applying PI. Apply Theorem 4 and take 
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F as any element of V™ such that \/5 = (F(xi), . . . , F(x n )). We obtain that 



for all F € B« )00 



E [n _1 # 2 (s, 5)] =E 



1 n 



J>0 



< C inf <^ R 2 2- 2Ja + 



n 



and the result follows by optimizing with respect to J. 

8.12. Proof of Theorem 5. In the sequel, p(P, Q) and h(P, Q) denote the 
Hellinger affinity and the Hellinger distance between the probabilities P, Q. 
For 9 £ e n , A \9) corresponds to the vector t = (A'(0i), . . .,A'(9 n )). We 
start with the following lemma. 

Lemma 2. Assume that Assumptions 7 and 8 hold. For all 8,8' £ n , 
t = A' (8) and t' = A' (8'), we have, 

n 

h 2 (Pe , P v ) < - lo S P fa ' P ^ ) < 4kH 2 (t,t'). 



Proof. Since 

h 2 (Pe,Pe>) 



1- p(P ,Pg>) = l-exp 



n 

< -Y,^gp(Pe v P e ^ 



n 

i=i 



i=i 



it suffices to show that 



- £ log p (P 9 . ,P^< 4KH 2 (t, t') = 4 K ^ H 2 (U, t'i). 
i=i i=i 
By summing over i, it is enough to show the inequality for n = 1, what we 
shall do. Let 8, 8' in & such that t = A' (8) and t' = A' (6'). With no loss of 
generality, we may assume that 8' < 9 and set 5 = (8 — 8')/2. The Hellinger 
affinity between Pg and Pq* is given by 



p(P e ,P e ,) = exp 
and therefore 

-]Qgp(P e ,Pe>) -- 



( A{9) + A{9>) A ( d + e, yj 
A{9)+A{9') 



A 



9 + 9' 



[A{8) + A{9 - 25) - 2A{8 - 5)} 

1 f 

- / (A'{u)-A'{u-5))du 

2 Je-8 

1 r r r u 

- / / A"{v)dv du. 

2 Je-8 Uu-s 
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Since t,t' £ R+ \ {0} and since under Assumption 8, A' , A" do not vanish 
on [0', 0] , for all u £ [0-5,0] and v € [u-6,u] 

A"{v) A"{u) 4y/A'(v)A'(u) 



A"(v) 



< 



< 4k 

giving thus, 

-log p(Pg,Pg,) < 2k 



A"(v) A"(u) 4A'(u) 
2^A^) 2^/Ap) A"(u) 
A"(v) A"(u) 



2^Mv)2^(u)' 

A"(v) A"(u) 



< 2k 



I / 

J e' Je 1 



u-6 2y/A'( V ) 2yjA'[u) 

9 A"(v) A"{u) 



dv 



du 



2k 



2^W(v)2^Mu) 



dv 



du 



l V 2^[Mv) 
= 2K(yMff) - y/A^fj' 

= 2K^vt~vt'y 



□ 



The proof of Theorem 5 is based on Assouad's Lemma (see Assouad (1983)), 
more precisely on the version given by Theorem 2.10 in Tsybakov (2004). 
In the sequel, u±, . . . , u-q denote an orthonormal basis of V and d(e, e') the 

Hamming distance between two elements e and e' of {0, 1} D , that is 

D 

d(e,e') = J2^e' j - 

3=1 

Let r G 1Z. There exists t° £ S such that the Euclidean ball (of V) centered 
at no = with radius r is contained in C. Consequently, there exists 
j3jj such that VtP = Ylf=i Pj u j an d that for all e £ {0, 1} D one can 



find t £ £ S such that 



D 



3=1 

Note that the for all e, e' £ {0, 1} D , 



2H 2 {t £ ,t £ ') 



r 2 d(e,e'). 
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infsupE s [H 2 (s,s)] > inf sup _E t e [H 2 (t £ , s)] 

P 2 (t £ A £ 



r 

~2 ,: 



Besides, 



> inf sup E^ 

£ ee{0,ip 

■ inf sup E t £ [<i 2 (e, £)] , 
" ee{0,i} 5 

where the two last infimum run among all estimators e based on the observa- 
tions pfi, . . . , X n ) with values in {0, 1} D . Theorem 2.10 in Tsybakov (2004) 
asserts that 

inf sup _E t s [d 2 (e, e)] > — (l — ^/a(2 - a)) 
£ £ g{o,i} d 

provided that for all e,e' such that d(e,e') = 1, 

/i 2 (P^P^) < a < 1, 

where # £ and # e ' corresponds to the parameters in associated to t £ ad t £ ' 
respectively. By taking a = 1/2 and using Lemma 2, for all e,e' such that 
d(e,e') = l 

h 2 (P d e,P eF j) < 4KH 2 (t £ ,t £ ') < 2kv 2 < - = a. 

Therefore, 

inf sup E s \H 2 (s,s)] > 1- ^ /2 Dr 2 , 
which concludes the proof since r is arbitrary in 1Z. 



8.13. Proof of Theorem 6. The proof is based on Theorem 2. Let us first 
check that the assumptions of this theorem hold. The marginal of X being 
given by s = (//, we already know from Proposition 3 that Assumption 2 
holds true for Example 2 with a = 1/6, 6=1 and c = \/2/36 (which leads 
to the value r = 50z~ 2 ). In order to check Assumption 3, we distinguish 
between Collections (CI) and (C2). 

Case of Collection (CI). For any m G M, by using Propositions 9 and 12 
in Birge (2006) with vf = % (M',d) = (M n , || ||), M = C, A = 1 = e and 
T = 7m where 7^ = 7" is given by (28) as a discretization of the linear space 
V m , we obtain from the Discretization PI a discretized subset 7^ of H^V m 
satisfying the properties (8.15) and (8.16) given Birge (2006), that is for all 
g G W 1 and r > fj m /2 

\{t€T^, \\g-t\\ <ry^}| <exp 



\VmJ 
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and d(g,T^ n ) < 4d(g,T m ) (where d(.,.) denotes the Euclidean distance). 
Since for all g,g' G C, H 2 (q g ,q g >) > R 2 \\g — g'\\ 2 , for all g € C and x > 
2 Vm = Rr] m /2 ' 



\S m nB(q g ,x^)\ < \{geT m , Wa-g'W < BT 1 ^}] 

< exp 



4.2R~ 2 D r ■' " 



< exp 



D m ( x 



3 V 7 ? 



Consequently, since r] m = 2D m /3 for all m, the family {S m , m £ Ai] satis- 
fies Assumption 3 with M = 1. 

Case of Collection (C2). By using (40) and arguing as in the previous case, 
for all g G W 1 and x > 2r] m = 2Rr] m , 

\S m nB(g g ,xV?)\ < \{g'eT^, \\g-g'\\ <£ _ W^}| 

1.031i?" 2 ^ r 



< exp 

< exp 



imiD r 



x 



X 

Vm 
2 



and we deduce similarly that the family {S m , m £ M.} satisfies Assump- 
tion 3 with M = 1. 

Let us now finish the proof of Theorem 6. Since in both cases, Assumption 3 
holds, we can apply Theorem 2 and get that 



for some C depending on X, z only Under Assumption 9, we derive that 



E 



J? 















inf |e 




f-fx 


2 + inf 


inf 




AeA [ 









+ tR 2 (D m VA ra ) 
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In the case of Collection Ci, we conclude by using that for m = m(A) 



inf I inf 

meM \teT> 



fx-t 

2 



+ tR 2 (D m VA m ) 

|2 



< 


2 


/- 


fx 


< 


2 


f- 


fx 


< 


2 


f- 


fx 



+ 2 inf \\f-t\\ 2 + rR (D m VA m ) 



+ 32 inf \\f-t\\ 2 + TR 2 (D m V A m ) 
+ 64 inf ||/-t|| 2 + 64rC + r^ 2 (^ m VA m ) 
2 + r (64 x lir 2 + RT 2 ) (D m V A m ) . 



< 66 



f-fx 



For collection C2, we conclude by using that 7^ is a ?7 m y / T-net for C m and 
that for m = m(A), 

2 



inf inf 



+ tR 2 (D m VA ra ) 



< 2 



A - % m A + 2rC + tR (D m V A m ) 
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